Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

I am looking forward to taking part in this course. Being honest, I felt a bit intimidated at first, as I’ve only recently started using R. I have never done any coding before, and only used SPSS as my statistics software. However, I believe mastering R and learning the basics in data science are crucial, if one wishes to pursue any sort of future in academia.

I hope that this introduction to data science will help me laern R and get a grasp of what data science is. I hope to be able to incorporate the lessons I get here to my own PhD.

My repository: https://github.com/OlanderRasmus/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Nov 25 10:06:44 2020"

Chapter 2

Regression and model validation

date()
## [1] "Wed Nov 25 10:06:44 2020"

We will analyse a dataset consisting of answers to a survey on the attitude of student’s towarsd learning.

We begin by setting our working directory and reading in the data file and exploring the dimensions and structure of the dataset

#We'll first set the working directory.
setwd("~/Desktop/IODS2020/IODS-project")

#And then check that it is correct.

getwd()
## [1] "/Users/rasmusolander/Desktop/IODS2020/IODS-project"
#We'll then read the file and examine it.
learning2014 <- read.table(file = "learning2014.txt")

dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

As we can see, the data set consists of 166 observations of 7 variables. The column “gender” gives the respondents gender (M = male, F= Female) and the column “Age” their age in years. The column “attitude” describes the respondents attitude towards statistics and the column “points” their totalt points in an exam. The columns “deep”, “stra” and “sur” give the mean points to questions about deep, strategic and surface learning, respectively.

Let’s check the summary of the table, to know how the data is distributed.

summary(learning2014)
##     gender               Age           Attitude         Points     
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   : 7.00  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00  
##  Mode  :character   Median :22.00   Median :32.00   Median :23.00  
##                     Mean   :25.51   Mean   :31.43   Mean   :22.72  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75  
##                     Max.   :55.00   Max.   :50.00   Max.   :33.00  
##       deep            surf            stra      
##  Min.   :1.583   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.667   Median :2.833   Median :3.188  
##  Mean   :3.680   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :4.333   Max.   :5.000

Next we will examine the data visually. We’re interested in what variables that might affect points in the questionnaire, but we’ll start by examining the whole data set.

#We first accesses the library GGally and ggplot2.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

#And then draw a scatter plot matrix (p1).

p1 <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p1

Based on this, points seem to be correlated the strongest (positively) with attitude (correlation coefficient 0.437, followed by strategic learning (0.146) and negatively correlated with surface learning (-0.144).

We’ll create a linear model including all these three variables and check these.

my_model <- lm(Points ~ Attitude + stra + surf, data = learning2014)

summary(my_model)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The model predicts linearily that Points = 11,01711 + 0.33952 * Attitude + 0.8.5313 * Mean strategic score - 0.58607* Mean surface learning score.

The model has a an R^2 of 0.2074 (i.e explains 20.74 % of Points) and is in itself significant (p<0.05), however neither the mean strategic score nor the mean surface score are signficant. Let’s remove these.

new_model <- lm(Points ~ Attitude, data = learning2014)

summary(new_model)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now the model has a R^2 of 0.1906 and is still signficant (p<0.05). The R^2 value means that it is able to explain 19.06 % of the variance in the points.

Next we’ll check the diagnostics, i.e. that the assumptions of linear regression have not been broken. This is easiest done visually.

plot(new_model, which = c(1, 2, 5))

The first plots shows the residuals vs the fitted values, i.e. how far away each data point is from our predicted model. THe data point should be equally far from our predicted model throughout the model (homoscedasticity). The residuals are equally distributed througout the predicted values.

The second plot shows how residuals are distributed. The better they follow the Q-Q line, the more normal is their distributed. The residuals are sufficiently normal.

The third plot shows the residuals vs leverage. The further to the right a point is, the more leverage it has, i.e. affects the model. This is useful to detect significant outliers. THe model does not suffer form significant outliers,


Chapter 3

Logistic regression

date()
## [1] "Wed Nov 25 10:06:50 2020"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
  1. Let’s read in the data and print out the names of the variables.
alc <- read.csv(file= "data/alc.csv", header = T)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset describes student achievement in secondary education of two Portuguese schools and includes, among other things, performance, abscences and alcohol use. The dataset is derived from the performance in two separate classes, Portuguese and mathematics.

  1. Analysing the effect of alcohol use.

I will be looking into the effect of alcohol use on time spent studying (‘studytime’), number of failures (‘failures’), absences (‘absences’) and final grade (‘G3’. The hypothesis is that higher alcohol use is negatively associated with studytime and final grade, while positively associated with the number of failures and absences.

  1. Exploring the distributions

It is good to first examine the data graphically. Alcohol use is a numeric variable ranging from 1-5, time spent studying an ordinal variable on a range from 1-4, the number of failures ordinal on a range 1-4, absences numeric (n of abscenes) and final grade is numeric as well ranging from 0-20.

p1 <- ggplot(alc, aes (x=alc_use, y = studytime)) + geom_bar(stat="identity")
p1

p2 <- ggplot(alc, aes (x=alc_use, y = failures)) + geom_bar(stat="identity")
p2

p3 <- ggplot(alc, aes (x=alc_use, y = absences, group = alc_use)) + geom_boxplot()
p3

p4 <- ggplot(alc, aes (x=alc_use, y = G3, group = alc_use )) +  geom_boxplot()
p4

Studytime is difficult to spot as a trend from the graph, same for failures. However, the number of absences seems to increase with increased alcohol consumption, while the final grade decreases.

Let’s look at the means per category of alcohol use.

alc %>%
  group_by(alc_use) %>%
  summarise(st = mean(studytime), fail = mean(failures),  ab = mean(absences), grade = mean(G3))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 5
##   alc_use    st  fail    ab grade
##     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1    2.31 0.107  3.36  12.0
## 2     1.5  1.97 0.145  4.23  11.6
## 3     2    1.98 0.220  3.92  11.3
## 4     2.5  1.93 0.136  6.43  11.7
## 5     3    1.72 0.562  6.09  10  
## 6     3.5  1.47 0.471  5.65  10.2
## 7     4    1.78 0.222  6     10.2
## 8     4.5  2    0     12     10.7
## 9     5    1.67 0.556  6.89  10.9

In addition to the earlier noted trends in abscences and grades, failures seems to incerease with increased alcohol consumption, while studytime decreases.

  1. Logistic regression

Next, lets explore the relationship between high alcohol use and studytime, failures absences and grade, and create odds ratios with 95% CI’s.

m1 <- glm(high_use ~ studytime + failures + absences + G3, data = alc, family = "binomial")

summary(m1)
## 
## Call:
## glm(formula = high_use ~ studytime + failures + absences + G3, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1570  -0.8393  -0.6605   1.1103   2.1227  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.02933    0.56149   0.052 0.958346    
## studytime   -0.47423    0.15920  -2.979 0.002893 ** 
## failures     0.29443    0.20422   1.442 0.149383    
## absences     0.07766    0.02271   3.420 0.000626 ***
## G3          -0.03519    0.03868  -0.910 0.362964    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 429.15  on 377  degrees of freedom
## AIC: 439.15
## 
## Number of Fisher Scoring iterations: 4
coef(m1)
## (Intercept)   studytime    failures    absences          G3 
##  0.02932628 -0.47423143  0.29442757  0.07765857 -0.03518531
OR1 <- coef(m1) %>% exp

CI1 <- confint(m1) %>% exp
## Waiting for profiling to be done...
cbind(OR1, CI1)
##                   OR1     2.5 %    97.5 %
## (Intercept) 1.0297605 0.3412502 3.1040114
## studytime   0.6223632 0.4513693 0.8437505
## failures    1.3423577 0.9000830 2.0158575
## absences    1.0807536 1.0359926 1.1326415
## G3          0.9654265 0.8948227 1.0417846

By looking at the summary of the model, we can see that failures and G3 are not significant predictors. This becomes even more clear when examining the ORs and their CIs. The ORs and their 95 CIs are as follows: studytime 0.62 (0.45-0.84), failures 0.62 (0.45-0.84), absences 1.08 (1.04-1.13) and G3 0.97 (0.89-1.04).

Both failures and G3 include one within their 95 % CI and are not significant predictors, while studytime and absences are.

Thus are hypothesis that increased alcohol consumption is related to an increase in failures and decrease in G3 is not necesseraily true, while there seems to be an association between increased alcohol use and decreased time spent studying and increased absences.

  1. Let’s create a new model and see its predictive value. First, let’s check the ORs and their 95 CI’s.
m2 <- glm(high_use ~ studytime + absences, data = alc, family = "binomial")

summary(m2)
## 
## Call:
## glm(formula = high_use ~ studytime + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2128  -0.8387  -0.7046   1.1996   2.1832  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.16640    0.33954  -0.490 0.624087    
## studytime   -0.55015    0.15550  -3.538 0.000403 ***
## absences     0.08054    0.02285   3.524 0.000425 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 433.65  on 379  degrees of freedom
## AIC: 439.65
## 
## Number of Fisher Scoring iterations: 4
coef(m2)
## (Intercept)   studytime    absences 
## -0.16639583 -0.55015181  0.08054463
OR2 <- coef(m2) %>% exp

CI2 <- confint(m2) %>% exp
## Waiting for profiling to be done...
cbind(OR2, CI2)
##                   OR2     2.5 %    97.5 %
## (Intercept) 0.8467110 0.4349166 1.6508083
## studytime   0.5768622 0.4211716 0.7758799
## absences    1.0838772 1.0386751 1.1361111

The model looks good, with both studytime and absences as signficant predictors. Now, let’s move to predicting. Let’s first examine the prediction using crosstabs and plotting the values.

probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability= probabilities)
alc <- mutate(alc, prediction = probability >0.5)

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   256   12
##    TRUE     96   18
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67015707 0.03141361 0.70157068
##    TRUE  0.25130890 0.04712042 0.29842932
##    Sum   0.92146597 0.07853403 1.00000000

The model predicts 256 correctly as false, 18 as correctly true, while predicting 96 cases as true when in reality false, and 12 as false when in reality true. This can be seen graphically as well.

Let’s further examine the model using the penalty/loss function.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2827225

Based on the loss function we can see that the model incorrectly predicts 0.28 of the data . A simple guessing strategy would give a loss of 0.50, indicating that our model is better than pure guess. Still, 0.28 is quite a large error.


Chapter 4

date()
## [1] "Wed Nov 25 10:06:52 2020"
library(dplyr)
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

Task 2

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset consists of 506 rows, 14 columns, and describes housing values in suburbs of Boston, published by Harrison D. et Rubinfield, D.L. in 1978, and Belsely et al. 1980. The data contains a set of variables per town, including per capita crime rate, zoning data, industry, demographics and housing.

Task 3

I chose to examine tha dataset using a correlation plot, and to examine the summary of the data.

library(corrplot)
## corrplot 0.84 loaded
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
cor_matrix <- cor(Boston) %>% round (2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

From the correlation plot and matrix we can see that, for instance, index of accessibility to radial highways (rad) is highly correlated with the full-value property-tax rate per/$10,000 (0.91). And the weighted mean of distances to five Boston employment centres (dis) is highly negatively correlated with the proportion of owner-occupied units built prior to 1940 (age) (-0.75), nitrogen oxides concentration (parts per 10 million) (nox) and the proportion of non-retail business acres per town (indus), correlation coefficients of -0.75,-0.77 and -0.71, respectively. From the summary view we can see that the distribution of variables varies greatly for each variable. However, the dummy variable chas (whether the Charles River is next to the town), is oddly described in this summary.

Task 4

Part 1, scaling the dataset and saving it as a data.frame.

boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)

From the summary, we can see that is now scaled in line to its own mean.

Next, we’re crating the a categorical variable of the crime rate, and drop the old crime rate variable from the dataset. I personally find this part somewhat confusing, which is why I kept the comments and code from datacamp intact.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins , include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Finally, we’re dividing the dataset in training and testing sets, with 80% of the data in the train set.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Task 5

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2500000 0.2351485 0.2698020 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.09618215 -0.9579728 -0.07348562 -0.9125882  0.47953750 -0.9263576
## med_low  -0.08606302 -0.2723132 -0.03844192 -0.5508797 -0.08009761 -0.3473981
## med_high -0.38794577  0.1550573  0.22498886  0.3884286  0.08169057  0.4082747
## high     -0.48724019  1.0169738 -0.05560795  1.0644916 -0.41829387  0.8152711
##                 dis        rad        tax    ptratio       black       lstat
## low       1.0004359 -0.6883741 -0.7285796 -0.4670275  0.37628555 -0.80148690
## med_low   0.2834409 -0.5611457 -0.4505840 -0.0732138  0.30789178 -0.15712974
## med_high -0.4182925 -0.4293981 -0.3377379 -0.2532004  0.07756901  0.02312643
## high     -0.8540894  1.6395837  1.5150965  0.7824713 -0.77952971  0.94875277
##                 medv
## low       0.54919810
## med_low   0.01064321
## med_high  0.15678093
## high     -0.74612226
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.13026928  0.639082660 -0.897140865
## indus    0.05396296 -0.201943608  0.308092160
## chas    -0.09110441 -0.063498643  0.036321413
## nox      0.38464976 -0.605902021 -1.483675382
## rm      -0.13910738 -0.012398559 -0.004036808
## age      0.19684403 -0.213225969 -0.329878475
## dis     -0.07730313  0.094866660 -0.072903538
## rad      3.45441103  0.871348478 -0.177674951
## tax     -0.08278407  0.131413971  0.742573715
## ptratio  0.12630478  0.007397766 -0.318687015
## black   -0.10562998  0.028737677  0.153052294
## lstat    0.22566774 -0.162330931  0.341934368
## medv     0.16434807 -0.249436703 -0.325923537
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9529 0.0356 0.0115
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Task 6

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      15        2    0
##   med_low    3      18        4    0
##   med_high   1       6       22    2
##   high       0       0        0   18

Looking at the results, our model correctly predicted 13 as low, 13 as medium low, 18 as medium high and 28 high, giving in total 72 correct predictions. However, the model also predicted 30 wrong, predicting especially low cases and medium high cases as medium low.

Task 7

Part 1: calculating the distances.

data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Part 2: K-means algorithm, visualising it (not needed), and investigating the optimal number. Note that the seed is set to 123, as K-means might prouce different results every time.

set.seed(123)
km <-kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2, col = km$cluster)

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal number of clusters is when the total WCSS changes radically. In this case, it seems to be 2, as after that, the slope flattens for each category.

Rerunning the clustering, with 2 clusters.

km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2, col = km$cluster)

Examining the plots, the clustering seems to be doing all right for each variable. However, in order to looking for closer trends I would like to have the knowledge of what groups we are typically looking for in this sort of data and compare the two predicted clusters to some sort of hypothesis.


Chapter 5

library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
human <- read.csv("data/human.csv", row.names = 1)

Task 1.

Let’s examine the data graphically.

ggpairs(human)

cor_matrix <- cor(human)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

Examining the scatterplots, there seems to be strong, linear associaton between expected education and life expectancy, and adolescent birth rate and maternal mortality. The association between life expectancy and GNI seems almost exponential. The strongest correlation overall is the one between maternal mortality and life expectancy, which is rather expected.

Let’s look at the summary of the data:

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

As we can see from the graphical distribution, and from the summaries, quite a lot of the data is rather skewed. Especially GNI, maternal mortality and adolescent birth rates have a rather large rightward tail, while life expectancy, education expectancy and female-male ration in the labour force all have leftward tails. This is confirmed from the summary view, as the mean and median of all mentioned variables are quite far apart (especially so for GNI, maternal mortality and adolescent birth rate).

#Task 2.

pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.5))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Figure 1. Biplot of the principal component analysis on data from the UN human development program, unstandardised.

Figure 1. Biplot of the principal component analysis on data from the UN human development program, unstandardised.

s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
pca_pr <- pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0

If we examine the biplot, we can see, that gross national income per capita (GNI) completely dominates everything else. In fact, it is parallel to the PC1 axis, showing that they are highly parallel. Examining the proportion of variance, we can see that PC1 stands for 99.999 % of the variance, while PC2 stands for 0.001 %, while the rest don’t really add at all. This rounds down to 100 % for PC1.

#Task 3.

Let’s do the same as above, but now standardising the data.

human_std <- scale(human)
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human_std <- prcomp(human_std)
biplot(pca_human_std, choices = 1:2, cex = c(0.5, 0.5))
Figure 2. Biplot of the principal component analysis on the same data as above, but standardised.

Figure 2. Biplot of the principal component analysis on the same data as above, but standardised.

s_std <- summary(pca_human_std)
s_std
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
pca_pr_std <- round(100*s_std$importance[2,], digits = 1)
pca_pr_std
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

The results differ markedly from those above. This is caused by the standardisation of the data, which removed the dominating effect of GNI, allowing us to see the effect of the other variables as well. Looking PC1 and PC2, they now explaining 53.6% and 16.2 %, respectively. Now, for instance, we can see that GNI, education and life expectancy, all “pull in the same direction”, and all are highly negatively correlated with maternal mortality and adolescent birth rate. This is to be expected, as we might reasonably assume, that people in countries with a higher gross national income per capita (GNI) have a higher life expectany and are expected to more highly educated, while at the same time the opposite is true for countries with higher maternal mortality and adolescent birth rate. In fact, we can also see the logic behind a higher adolescent birth rate and maternal mortality being related - as they can both be logically assocaited with a lower access for women to healtchare services.

#Task 4.

My interpretation is that GNI is such a dominating factor, that it hides everything else within it. Alone, it would explain nearly all of the variance, as seen in task 2, where it was parallel to PC1 which explained more than 99.9 % of the variance. However, when standardising, we can see that it is not really the greatest explainer of variance, with its arrow being shorter than any one of the other arrows.

#Task 5.

library("FactoMineR")

data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

The dataset tea consists of 300 observations of 36 variables, based on a questionnaire on how individuals drink tea (18 questions), product’s perception (12 question) and personal details (4 questions). In addition, age has been divided into classes. I was unable to find information on what the 36th variable would be, but I assume it is frequency of drinking tea.

I’ll examine the columns containing type of tea (“Tea”), whether drunk tea alone or with milk or lemon (“How), whether tea bags or unpackaged, or obth (”how“), use of sugar (”sugar“), where tea is bought from”where“, and whether it is drunk at lunch or not (”lunch").

library(dplyr)
library(tidyr)
library(ggplot2)
library(FactoMineR)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))

summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The data is visualised and summarised above. We can see that that the data is split into three groups for “how”, “Tea” and “where”, in two groups for “lunch” and “sugar” and four groups for “How”. The distribution is seen above.

Next, we’ll do an MCA on the data.

# multiple correspondence analysis and visualize
mca <- MCA(tea_time, graph = T)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = T) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

We’ll start by examining the summary of the MCA. Notably, from the Eigenvalues table we can see that dimensions 1 and 2 explain around 29 % of the variance, with dimension 1 explaining 15 % and dimension 2 14%. Looking at the table of the categorical variables, we can see that “how” and “where” are correlated with dimensions 1 and 2, being surpassed by “Tea” and “How” for dimension 3.

The variables representation plot shows the same information graphically, and we can in fact see that “how” and “where” are correlated quite the same for dimensions and 2. Thus we can presume, that whether tea is bought in a tea shop or supermarket and whether it is made from tea bags or unpackaged tea seems to tell our gropus apart.

Finally, examining the MCA factor map for the variables, we can see that the same people who buy their tea unpackaded are likely to buy it from a tea shop, and the tea is quite often green tea. Similarly, tea bags and chain stores seems to similar, and the people who buy both tea bags and unpackaged tea do it both from tea shops and chain store. Quite intuitive! We’ll leave the MCA factor map for the indivuals as it is - it doesn’t really tell us that much about the whole.